Goto

Collaborating Authors

 deepfake speech


Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform

Xie, Yuankun, Fu, Ruibo, Wang, Xiaopeng, Wang, Zhiyong, Li, Ya, Wen, Zhengqi, Cheng, Haonnan, Ye, Long

arXiv.org Artificial Intelligence

The rapid advancement of speech generation technology has led to the widespread proliferation of deepfake speech across social media platforms. While deepfake audio countermeasures (CMs) achieve promising results on public datasets, their performance degrades significantly in cross-domain scenarios. To advance CMs for real-world deepfake detection, we first propose the Fake Speech Wild (FSW) dataset, which includes 254 hours of real and deepfake audio from four different media platforms, focusing on social media. As CMs, we establish a benchmark using public datasets and advanced selfsupervised learning (SSL)-based CMs to evaluate current CMs in real-world scenarios. We also assess the effectiveness of data augmentation strategies in enhancing CM robustness for detecting deepfake speech on social media. Finally, by augmenting public datasets and incorporating the FSW training set, we significantly advanced real-world deepfake audio detection performance, achieving an average equal error rate (EER) of 3.54% across all evaluation sets.


Could YOU spot a deepfake? Scientists find humans struggle to detect AI speech even when they've been trained to look out for it

Daily Mail - Science & tech

Humans are unable to detect over a quarter of speech samples generated by AI, researchers have warned. Deepfakes are fake videos or audio clips intended to resemble a real person's voice or appearance. There are growing fears this kind of technology could be used by criminals and fraudsters to scam people out of money. Now, scientists have discovered people can only tell the difference between real and deepfake speech 73 per cent of the time. While early deepfake speech may have required thousands of samples of a person's voice to be able to generate original audio, the latest algorithms can recreate a person's voice using just a three-second clip of them speaking.